





# Design and Analysis of Operating Systems CSCI 3753

**Memory Management** 

Dr. David Knox
University of
Colorado Boulder

Material adapted from: Operating Systems: A Modern Perspective : Copyright © 2004 Pearson Education, Inc.

## Memory Management Paging

### **Fragmentation**

- over time, repeated allocation and deallocation will cause many small chunks of non-contiguous unallocated memory to form between allocated processes in memory
- resulting in external fragmentation
- OS must swap out current processes to create a large enough contiguous unallocated memory
- could use de-fragmentation,but it is costly to move memory

OS unallocated P1 unallocated **P**3 unallocated

RAM

 A better solution to external fragmentation is to divide the logical address space into fixed-size pages

Logical Address
Space

page 0

page 1

page 2

page 3

page 4

We can also break main memory into fixed-sized frames

Each page can be located in any frame

The frames used by a process don't have to be contiguous in memory

Need a method to map from logical address to physical address frame # page 4 0page 0 unallocated 2 3 page 2 4 page 1 5 unallocated 6 unallocated page 3

RAM



OS maintains a page table for each process

Given a logical address, MMU finds its logical page, then looks up physical frame in *page table* 

Logical Address
Space

page 0

page 1

page 2

page 3

page 4



**RAM** 

page 4

frame

Logical Address
Space

page 0

page 1

page 2

page 3

page 4

#### User's view of memory is still as one contiguous block of logical address space

 MMU performs run-time mapping of each logical address to a physical address using the page table

#### Typical page size is 4-8 KB

- Page Table Size: a 4 GB 32-bit address space with 4 KB/page  $(2^{12}) \Rightarrow 2^{32}/2^{12} = 1$  million entries in page table
  - Your page table would need at least 20 bits per table entry (~ 3 MB per process)
- Address Limitations:
  - $2^{32}$  bytes = 4 GB of memory
  - 2<sup>44</sup> bytes = 16 TB (32 bit table entry \* 12 bits per page)
  - 2<sup>64</sup> bytes = 16 Exabytes
- There is support for huge/large pages ~ 4 MB each

No external fragmentation



But we do get some internal fragmentation

- example: suppose my process is size 4100 B, and each page size is 4 KB (4096 B)
  - then I have to allocate two pages = 8 KB,
  - 4092 B of 2nd page is wasted due to fragmentation that is internal to a page

 OS has to maintain a frame table/pool that keeps track of what physical memory frames are free

Conceptually, every logical/virtual address can now be divided into two parts:

#### logical page and page offset

- most significant bits = logical page # l,
  - Same as the virtual address / page size
  - Logical page is used to index into page table to retrieve the corresponding physical frame f
- least significant bits of the address represent the page offset d





### Implementing a page table



- Option #1: use dedicated bank of hardware registers or memory to store the page table
  - fast per-instruction translation
  - slow per context switch entire page table has to be reloaded
  - limited by cost (expensive hardware) to being too small some page tables can be large, e.g. 1 million entries – too expensive

### Implementing a page table



- Option #2: store the page table in main memory
  - keep a pointer to the page table in a special CPU register called the Page Table Base Register (PTBR)
  - can accommodate fairly large page tables
  - fast context switch only reload the PTBR!
  - slow per-instruction translation, because each instruction fetch requires two steps:
    - finding the page table in memory and indexing to the appropriate spot to retrieve the physical frame # f
    - 2. retrieving the instruction from physical memory frame f

### **Paging with PTBR**

**RAM** 

- slow per-instruction translation, because each instruction fetch requires two steps:
- finding the page table in memory and indexing to the appropriate spot to retrieve the physical frame # f
- **2.** retrieving the instruction from physical memory frame f



### Implementing a page table

- Option #3: cache a subset of page table mappings/entries in a small set of CPU buffers called Translation-Look-aside Buffers (TLBs)
  - Fast solution to option #2's slow two-step memory access
  - Several TLB caching policies:
    - cache the most popular or frequently referenced pages in TLB
    - cache the most recently used pages

### Paging with TLB and PTBR

Translation-Look-aside Buffers



**RAM** 

- MMU in CPU first looks in TLB's to find a match for a given logical address
  - 1. if match found, then quickly call main memory with physical address frame f (plus offset d)
    - this is called a TLB hit
    - TLB as implemented in hardware does a fast parallel match of the input page to all stored values in the cache - about 10% overhead in speed

(continued next slide)

#### (continued from previous slide)

- 2. if no match found, then this is a *TLB miss* 
  - a) go through regular two-step lookup procedure: go to main memory to find page table and index into it to retrieve frame #f, then retrieve what's stored at address <f,d> in physical memory
  - b) Update TLB cache with the new entry from the page table
    - if cache full, then implement a cache replacement strategy, e.g. Least Recently Used (LRU) - we'll see this later

### Goal is to maximize TLB hits and minimize TLB misses

- On a context switch, the TLB entries would typically have to all be invalidated/completely flushed
  - since different processes have different page tables
  - x86 behavior behaves likes this
- An alternative is to include process IDs in TLB
  - at the additional cost of hardware and an additional comparison per lookup
  - Only TLB entries with a process ID matching the current task are considered valid
  - See DEC RISC Alpha CPU

- Another option: prevent frequently used pages from being automatically invalidated in the TLBs on a task switch
  - In Intel Pentium Pro, use the page global enable (PGE) flag in the register CR4 and the global (G) flag of a pagedirectory or page-table entry
- ARM allows flushing of individual entries from the TLB indexed by virtual address

- How does MMU interact with L1 or L2 data or instruction caches?
  - It depends on whether the items in a cache are viewed as virtual or physical
- L1/L2 data/instruction caches can store their information and be indexed by either virtual or physical addresses
  - If physical, then MMU must first convert virtual to physical, before the cache can be consulted – this is slow, but each entry is uniquely identifiable by its physical address
  - If virtual, then cache can be consulted quickly to see if there's a hit without invoking the MMU (if miss, then MMU must still be invoked...)

#### Paging with TLB and L1/L2 Caching



#### From prior figure:

- The desired logical address is first searched for in a virtually indexed L1 or L2 cache if such a cache is provided by the hardware designer.
- If a match is found, then CPU uses this cached item immediately.
- Otherwise, the MMU must be invoked to convert the logical to physical address as described earlier.
  - TLB caching improves this speed of translation

#### From prior figure:

- After the MMU has produced the desired physical address, then a physically indexed L1 or L2 cache is searched for the desired physical address, if such a cache is provided by the hardware designer.
- If a match is found, then CPU uses this cached item immediately.
- Otherwise, must fetch item from main memory.

- A virtually indexed L1/L2 data/instruction caches introduces some problems:
  - Homonym problem: when a new process is switched in, it may use the same virtual address V as the previous process.
    - The cache that indexes just by virtual address V will return the wrong information (cached information from the prior process).

#### Some solutions to the homonym problem:

- Flush the cache on each context switch
  - process gets entire cache to itself, but have to rebuild cache
- Add an address space id (process id) to each entry of the cache
  - so only data/instructions for the right process are returned for a given virtual address V
  - requires hardware support and an extra comparison
  - reduces available cache space for each process, since it has to be shared
- Each process uses non-overlapping virtual addresses in its address space
  - unlikely, violates model that each process is compiled & executes independently in its own address space [0,MAX]

- In practice,
  - Most L1 caches are virtually indexed fast
  - Most L2 caches are physically indexed
    - Each entry is unique
    - No collisions
    - This is good for code/data from shared library pages,
       i.e. if multiple processes share the same code/data,
       then it just has to be stored once in cache
  - Thus in the figure, the virtually indexed cache is essentially a small L1 cache, and the physically indexed cache is a much larger L2 cache.

#### Paging with TLB and L1/L2 Caching



 Can share code & data pages between processes by having entries in different process' page tables point to the same physical page/frame(s)

#### Sharing data:

- Two or more processes may want to share memory between them, so pointing multiple page tables to the same data pages is a way to implement shared memory
- Shared data should be protected by synchronization

#### Sharing code:

- Fork()' ing a child process causes the child to have a copy of the entire address space of the parent, including code
  - Rather than duplicating all such code pages, can simply map the child's page table to point to the same set of code pages as the parent.
  - This is a way to implement copy-on-write
- May want to share dynamically linked libraries (image, C, ...) between multiple processes
  - so map each process' page table to point to the same dll pages in memory
- Shared code should be thread-safe and reentrant

Page tables can point to the same memory frames



Can share code & data pages between processes by having entries in different process' page tables point to the same RAM physical page/frame(s) data1 data2 P1's logical P1's page address space table A2 P2's logical P2's page 0 address space table lib1 0 A2 **A1** 4 **A1** 2 3 lib1 lib1 5 lib2 5 5 lib2 lib2 3 0 В data1 data1

Shared code should be thread-safe and reentrant

#### **Sharing data:**

Two or more processes may want to share memory between them, so pointing multiple page tables to the

same data pages is a way to implement shared memory



RAM

- Fork()' ing a child process causes the child to have a copy of the entire address space of the parent, including code
  - Rather than duplicating all such code pages, can simply map the child's page table to point to the same set of code pages as the parent. 0
  - This is a way to implement copy-on-write





RAM

data1

A2

lib1

**A1** 

lib2

### **Hierarchical Paging**

- Problem with page tables: they can get very large
  - example: 32-bit address space with 4 KB/page ( $2^{12}$ ) implies that there are  $2^{32}/2^{12} = 1$  million entries in page table per process. if each entry is 4 bytes that is 4 MB.
  - it's hard to find contiguous allocation of at least 4 MB for each page table
- Solution: page the page table! this is an example of hierarchical paging.
  - Just as processes are split into pages to fit into a memory, split page table into pages to fit into memory
  - This is an example of 2-level paging. In general, we can apply N-level paging.

### **Hierarchical Paging**

- A logical address (on 32-bit machine with 1K page size) is divided into:
  - a page number consisting of 22 bits
  - a page offset consisting of 10 bits

| page number |       |          | page offset |
|-------------|-------|----------|-------------|
|             | $p_1$ | $\rho_2$ | d           |
|             | 12    | 10       | 10          |



- Since the page table is paged, the page number is further divided into:
  - a 12-bit page table number (outer page)
  - a 10-bit page table offset

### **RAM Hierarchical Paging** Hierarchical (2-level) paging: outer page table tells where each part of the page table is stored, i.e. indexes into the page table page table outer page table dotted arrow means "go to memory and retrieve the frame pointed to by the table entry"

### **Hierarchical Paging**

Hierarchical (2-level) paging divides the logical address into 3 parts:



### **Hierarchical Paging**

University of Colorado, Boulder

- While hierarchical page tables enable a large page table to be split to fit into memory, they do not solve the problem of the large size of the page table
  - if several processes have a page table that contains a million entries, then much of RAM may become devoted to storing/managing multiple page tables
  - The page tables may be *sparse* many of the entries in the page table are just empty placeholders for logical pages that are not in memory, and may never be in memory
    - e.g. stack and heap may never grow large enough to use all of allocated memory





## Design and Analysis of Operating Systems CSCI 3753



Dr. David Knox
University of
Colorado Boulder

Material adapted from: Operating Systems: A Modern Perspective : Copyright © 2004 Pearson Education, Inc.